1. Introduction

In this project, we are going to explore the properties of white wine and its relation to its quality. The text file is located here with more information regarding the physicochemical variables.

2. Dataset Exploration

First, we will load the dataset into R and examine its features.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
  • There are 4898 observations consisted of 1 unique identifier, 11 input variables, and 1 ouput variable.
  • No wine obtained a perfect score of 10.
  • free.sulfur.dioxide and total.sulfur.dioxide are discrete whereas all other input variables are continuous.
  • All variables except quality seemed to show positive-skewed distribution based on mean and median.
  • 3. Univariate Plots

  • It seems that quality is normally distributed with most occuring values at 6.
  • It is shown that majority of our input variables contain outliers and needs attention.
  • 4. Univariate Analysis


    4a. Chemical Properties

    It seems that most of our input variables such as chlorides and residual.sugar contains outliers. I decided to rescale the axis and determine whether the distribution is normal or skewed. Some observation on the distribution of the chemical property can be made:

  • Normal: pH
  • Normal + Outliers: total.sulfur.dioxide, citric.acid, fixed.acidity, density
  • Positive-Skewed : alcohol, sulphates
  • Positive-Skewed + Outliers : residual.sugar, free.sulfur.dioxide, chlorides, volatile.acidity


  • 4b. Feature Transformation

    get_ori_plot <- function(var, label) {
      ggplot(aes(x = (var)), data = df) + xlab(label) +
      geom_histogram(colour = "black",
                     fill = '#dbdd46',
                     bins = 30)
    }
    
    get_sqrt_plot <- function(var, label) {
      ggplot(aes(x = sqrt(var)), data = df) +  xlab(label) +
      geom_histogram(colour = "black",
                     fill = '#dbdd46',
                     bins = 30)
    }
    
    get_log_plot <- function(var, label) {
      ggplot(aes(x = log10(var)), data = df) + xlab(label) +
      geom_histogram(colour = "black",
                     fill = '#dbdd46',
                     bins = 30)
    }

    I decided to take the transformation for the positive-skewed features to determine whether it would display normal distribution afterwards. An example is shown below:

  • It is shown that taking the logarithmic transformation with base 10 for sulphates displayed preferred normal distribution and created a new variable for that transformation.

  • 4c. rating

    It was shown that quality was normally distributed with 6 as the most frequent rating. I decided to create a catagorical variable grade for future analysis with various features. In this project, we consider a rating of 3 - 4: Bad, 5 - 7: Average, and 8 - 9 : Good.

    ## 
    ##     Bad Average    Good 
    ##     183    4535     180

  • It is shown that we have roughly the same amount of Bad and Good wines.

  • 4d. free.to.bound

    After examining the structure of our dataset, I decided to examine the relations between variables. I decided to take a first look at our discrete input features: free.sulfur.dioxide and total.sulfur.dioxide. It was shown that total.sulfur.dioxide was composed of free.sulfur.dioxide and bound.sulfur.dioxide. The ratio seemed more appropriate than using free.sulfur.dioxide to total.sulfur.dioxide.

    df$bound.sulfur.dioxide <-
      df$total.sulfur.dioxide - df$free.sulfur.dioxide
    df$free.to.bound <- df$free.sulfur.dioxide / df$bound.sulfur.dioxide


    Univariate Conclusion:

    We got a quick overview on the distribution of each feature in our dataset. Our main interest quality was normally distributed with the most occuring value at 6. It was also shown that all input features were continuous but the sulfur.dioxide features. Since some of our features were positive-skewed, we created methods for transforming our feature to a more appropriate and normally distributed feature. After researching total.sulfur.dioxide, we created a new feature: the ratio of free.sulfur.dioxide to bound.sulfur.dioxide to analyze in the further sections. We also saw that majority of our features had outliers and would affect our future plots. Thus rather removing the outliers, we decided to rescale future plots to come.


    5. Bivariate Analysis


    5a. Correlation

    I decided to examine the relation between output and the input variables through the peason-r coefficient.

    ##                              [,1]
    ## fixed.acidity        -0.113662831
    ## volatile.acidity     -0.194722969
    ## citric.acid          -0.009209091
    ## residual.sugar       -0.097576829
    ## chlorides            -0.209934411
    ## free.sulfur.dioxide   0.008158067
    ## total.sulfur.dioxide -0.174737218
    ## density              -0.307123313
    ## pH                    0.099427246
    ## sulphates             0.053677877
    ## alcohol               0.435574715
    ## quality               1.000000000
    ## bound.sulfur.dioxide -0.217867760
    ## free.to.bound         0.164797933
    ##                       [,1]
    ## fixed.acidity        FALSE
    ## volatile.acidity     FALSE
    ## citric.acid          FALSE
    ## residual.sugar       FALSE
    ## chlorides             TRUE
    ## free.sulfur.dioxide  FALSE
    ## total.sulfur.dioxide FALSE
    ## density               TRUE
    ## pH                   FALSE
    ## sulphates            FALSE
    ## alcohol               TRUE
    ## quality               TRUE
    ## bound.sulfur.dioxide  TRUE
    ## free.to.bound        FALSE
    From the correlation matrix, we can see a relative correlation between..
  • alcohol and density
  • ## [1] -0.7801376
  • density and residual.sugar
  • ## [1] 0.8389665
  • total.sulfur.dioxide and density
  • ## [1] 0.5298813


    5b. quality Correlation

    I wanted to focus on 5 features that are highly correlated with quality:

  • alcohol
  • volatile.acidity
  • chlorides
  • density
  • bound.sulfur.dioxide

  • 6. Bivariate Plots

    Before plotting, I wanted to examine the transformation among the variables of interest:

    6a. Transforming bound.sulfur.dioxide

  • It is shown that square root of bound.sulfur.dioxide displayed a normal distribution.

  • 6b. Transforming alcohol

  • It is shown that square root of alcohol displayed a normal distribution.

  • 6c. Transforming chlorides

  • It is shown that original plot of chlorides displayed a better normal distribution.

  • 6d. Transforming volatile.acidity

  • It is shown that logarithm of volatile.acidity displayed a normal distribution.

  • Since our main focus was on discrete variable quality, I decided to use boxplots to explore the correlated features:

    # Function to make boxplots
    make_boxplot <- function(xvar, yvar, title, xlabel, ylabel) {
      ggplot(df, aes(x = xvar, y = yvar, fill = xvar)) +
      geom_jitter(alpha = .3)  +
      geom_boxplot(alpha = 0.5, color = 'blue') +
      stat_summary(fun.y = "mean", 
                   geom = "point", 
                   color = "red", 
                   shape = 8, 
                   size = 4) +
      scale_fill_manual(values = colors) +
      theme(legend.position = "none") +
      ggtitle(title) + theme(plot.title = element_text(hjust = 0.5))
    } + xlab(xlabel) + ylab(ylabel)

    1. Alcohol

  • We can see that alcohol % is shown to increase with quality.

  • 2. Volatile Acidity

  • We can see that alcohol % is shown to increase with quality and rating.

  • 3. Chlorides

  • We can see that chlorides is shown to decrease with quality and rating.

  • 4. Density

  • We can see that density is shown to decrease with quality and rating.

  • 5. Bound Sulfur Dioxide

  • We can see that bound.sulfur.dioxide is shown to decrease with quality and rating.

  • Based on our boxplots, it seems that..

  • increasing alcohol
  • decreasing volatile.acidity
  • decreasing chlorides
  • decreasing density
  • decreasing bound.sulfur.dioxide

  • .. was shown in higher quality wines.


    (Addition: Examining free.to.bound)

    I also decided to explore the relation of free.sulfur.dioxide to bound.sulfur.dioxide:

  • It is shown that ratio of free.sulfur.dioxide to bound.sulfur.dioxide is preferred in higher quality wines.

  • Bivariate Conclusion:

    Initially, we created a correlation matrix to determine which features were relative. We saw that the overall strongest correlation is 0.839 between density and residual.sugar. Though our main interest was to determine the features that were correlated with quality. The strongest relation with quality was alcohol with 0.436. After determining our main features, we catagorized the quality into rating: Bad (3-4), Average (5-7), Good (8-9) and generated the boxplots to examine the difference among quality. We saw that increasing alcohol but lowering volatile acidity, chlorides, density, and bound.sulfur.dioxide was shown in higher quality wines.

    7. Multivariate Analysis

    In order to examine various features, I decided to create a correlation matrix based on the features that were highly correlated with quality and examine those features among each other.

    ##                      volatile.acidity chlorides density alcohol
    ## volatile.acidity                FALSE     FALSE   FALSE   FALSE
    ## chlorides                       FALSE     FALSE   FALSE    TRUE
    ## density                         FALSE     FALSE   FALSE    TRUE
    ## alcohol                         FALSE      TRUE    TRUE   FALSE
    ## bound.sulfur.dioxide            FALSE     FALSE    TRUE    TRUE
    ##                      bound.sulfur.dioxide
    ## volatile.acidity                    FALSE
    ## chlorides                           FALSE
    ## density                              TRUE
    ## alcohol                              TRUE
    ## bound.sulfur.dioxide                FALSE

    8. Multivariate Plots

    ## 
    ##     Bad Average    Good 
    ##     183    4535     180

    Since Average was accounting for 4535 observations, I decided to focus primarily on the Good and Bad wines.

    7a. free.sulfur.dioxide vs. bound.sulfur.dioxide

  • Initially, I wanted to explore the features composing of total.sulfur.dioxide. It seems that free.sulfur.dioxide is preferred to be between 25-50 mg/liter while bound.sulfur.dioxide is preferred to be between 50-100 mg/liter for higher quality wines. It is also shown that higher quality wines have a trend of increasing free.sulfur.dioxide.

  • 7b. density vs. bound.sulfur.dioxide

  • Based on just Bad and Good quality wines, it seems that Good wines display a higher concentrated range than Bad wines. We can confirm that decreased density and bound.sulfur.dioxide tends to be in better wine.

  • 7c. alcohol vs. chlorides

  • We can also confirm that Good wines tends to be more concentrated than Bad wines. The plots show that higher alcohol and lower chlorides result in higher quality wines.
  • 7d. Linear Model

    The selected features used in the linear model were based on its correlation with quality:

    ## 
    ## Call:
    ## lm(formula = quality ~ volatile.acidity + alcohol + chlorides + 
    ##     bound.sulfur.dioxide + density, data = df)
    ## 
    ## Coefficients:
    ##          (Intercept)      volatile.acidity               alcohol  
    ##           -3.714e+01            -2.026e+00             3.880e-01  
    ##            chlorides  bound.sulfur.dioxide               density  
    ##           -1.278e+00            -3.398e-04             3.983e+01

    We can see that the equation for quality is heavily depended on the density of the wine although alcohol was considered the highest correlated variable. I am speculating that it is because density is also relatively correlated with other features as well such as bound.sulfur.dioxide and chlorides.

    cor(df[,c(2:13,16)], df$density)
    ##                             [,1]
    ## fixed.acidity         0.26533101
    ## volatile.acidity      0.02711385
    ## citric.acid           0.14950257
    ## residual.sugar        0.83896645
    ## chlorides             0.25721132
    ## free.sulfur.dioxide   0.29421041
    ## total.sulfur.dioxide  0.52988132
    ## density               1.00000000
    ## pH                   -0.09359149
    ## sulphates             0.07449315
    ## alcohol              -0.78013762
    ## quality              -0.30712331
    ## free.to.bound        -0.07921315


    Multivariate Conclusion:

    We saw that the rating for Average contained 4535 observations. We focused particularly on Good and Bad wines which is roughly around 180 observations each. For this analysis, we are examining the relation among the features that are correlated with quality themselves. The most interesting observation we determined was that Good wine tended to show a relatively higher concentration compared to Bad wine. We were also able to confirm the boxplot trends seen in our bivariate analysis as well using scatterplots. With the linear model, we saw that density had the most influence in determining quality.


    Final Plots and Summary


    * Plot One: Correlation Matrix


    Description One

    The correlation matrix allowed us to observe which features are relatively important that is in keen with our main interest quality. It allowed us to explore other variables in the multivariate analysis that weren’t just correlated with quality itself. We were able to see that the most correlated feature with the quality was alcohol followed by density.


    * Plot Two: alcohol vs. quality


    Description Two

    We saw that alcohol had the highest correlation of 0.4355 with quality. Increasing alcohol % by volume was displayed in wines scored higher.


    * Plot Three: alcohol vs. chlorides


    Description Three

    In our multivariate analysis, we saw that Good wine tended to show a better concentrated range than Bad wines such as the plot above. We were also able to confirm that higher alcohol and lower chlorides tend to be preferred in Good wines which was also observed from our bivariate analysis.

    Reflection

    The white wine dataset contained 4898 observations with 11 chemical properties. After exploring the dataset, we were able to successfully determine the main factors that affected wine quality which were through..

  • increasing alcohol
  • decreasing volatile.acidity
  • decreasing chlorides
  • decreasing density
  • decreasing bound.sulfur.dioxide

  • However, quality is subjective and we cannot solely base quality from physiochemical properties. There are other properties not mentioned in the dataset that could play a bigger factor in the quality. Through our various plots, we were able to get an outlook on how a wine is rated based solely on physiochemical properties.

    Throughout the project, we saw that there are many outliers that affected the initial distribution of the data and that not all were normally distributed. Thus rescaling and transformation was necessary in future plots. In this project, we were able to transform it appropriately to determine the best fit for our plots.

    If possible, more Bad and Good wine data would allow us to have a better understanding on the wine’s quality.